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Abstract 

In this article, we derive concentration inequalities for the cross-validation estimate of the gen- 
eralization error for subagged estimators, both for classification and regressor. General loss func- 
tions and class of predictors with both finite and infinite VC-dimension are considered. We slightly 
generalize the formalism introduced by |DUD03j to cover a large variety of cross-validation pro- 
cedures including leave-one-out cross-validation, fc-fold cross-validation, hold-out cross-validation 
(or split sample), and the leave- ?>out cross-validation. 

An interesting consequence is that the probability upper bound is bounded by the minimum of 
a Hoeffding-type bound and a Vapnik-type bounds, and thus is smaller than 1 even for small 
learning set. Finally, we give a simple rule on how to subbag the predictor. 

Keywords: Cross-validation, generalization error, concentration inequality, optimal splitting, re- 
sampling. 
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1 Introduction and motivation 



One of the main issue of pattern recognition is to create a predictor (a regressor or a classifier) which 
takes observable inputs in order to predict the unknown nature of an output. Typical applications 
range from predicting the figures of a digitalized zip code to predicting the chance of survival from 
clinical measurements. Formally, a predictor 4> is a measurable map from some measurable space X 
to some measurable space y. When y is a countable set (respectively W n ), the predictor is called a 
classifier (respectively a regressor). The strategy of Machine Learning consists in building a learning 
algorithm $ from both a set of examples and a class of methods. Typical class of methods are empirical 
risk minimization or /c-nearest neighbors rules. The set of examples consists in the measurement of 
n observations (xi, yi)i<i< n - Thus, formally, $ is a measurable map from X x U n (X x y) n to y. 
One of the main issue of Statistical Learning is to analyse the performance of a learning machine in a 
probabilistic setting, (xi, yi)i<i< n are supposed to be observations from n independent and identically 
distributed (i.i.d.) random variables (Xj, Yj)i<j<„ with distribution P. (X i7 Yi)i<i<„ is denoted T> n in 
the following and called the learning set. In order to analyse the performance, it is usual to consider the 
conditionnal risk of a machine learning $ denoted R n , so called the generalization error. It is defined 
by the conditional expectation of L(Y,$(X,T> n )) given V n where (X,Y) <~ P is a random variable 
independent of T> n , i.e. R n := ¥,x y{L(Y, 2? n ))|2?„) with L a cost function from y 2 — > M + . 
Notice that R n is a random variable measurable with respect to D n . 

Bagging, to be defined formally below, is a procedure building an estimator by a resample and com- 
bine technique. Bagging [bootstrap aggregating] was introduced by [?] to reduce the variance of a 
predictor. From an original estimator, a bagged regressor is produced by averaging several replicates 
trained on bootstrap samples, a bagged classifier is produced by voting at the majority. It is one 
of the recent and successful computationally intensive methods for improving unstable estimation or 
classification schemes. It is extremely useful for large, high dimensional data set problems where 
finding a good model or classifier in one step is impossible because of the complexity and scale of the 
problem. Regarding prediction error, the method often compares favorably with the original predic- 
tor, and also, in situations with substantial noise, with other ensemble methods such as boosting or 
randomization. Hence it is very important to understand the reasons for its successes, and also for 
its occasional failures. However, even if it has attracted much attention and is frequently applied, 
important questions remain unanswered theoretically. In this article, we study a variant of bagging 
called Subagging [Subsample aggregating] that has appeared in [?] and [?]. It is more accessible for 
analysis and has also substantial computational advantages. The subagged estimator will be denoted 
by $ B (X,£>„) or $ r f (X) in the following. 

Important questions are:^/s the generalization error of a subagged predictor lower than the original 
predictor, i.e i?„($^) < i?„(<&)? The distribution P of the generating process being unknown, can we 
estimate the generalization error of a subagged predictor? Our strategy is the following: after briefly 
emphasizing the difficulty to provide a general answer to the first question, we will concentrate on the 
second question. To estimate the generalization error of a subagged predictor, we propose to use an 
adapted cross-validation estimator denoted by Rev ($)■ 

[?] aggregates regression trees to build random forest and calls this process bagging. [?] prove that 
the bagged functional is always smooth in some sense. [?] also show that bagging can increase both 
bias and variance. [?] prove that (in the limit of infinite samples) bagging reduces the variance of non- 
linear components of the Taylor decomposition while leaving the linear part unaffected. [?] consider 
non-diffcrcntiable and discontinuous predictors and concentrate on the asymptotic smoothing effect 
of bagging on neighborhood of discontinuities of decision surfaces. [?] brings new argument to explain 
bagging effect: bagging's improvement /deteriations are explained by the goodness/badness of highly 
influential examples. [?] prove the effect of bagging on the stability of a learning method and derive non 
asymptotic bounds for the approximation error of the bagging predictor. An interesting asymptotic 
result was derived in [?] : asymptotically, bagging of weak predictors can produce a strong learner, 
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namely the bayes classifier. However, a general answer to the following non-asymptotic question 
i?„(<£>^) < i?„ (<!>)? seems hard to reach in a general framework. Using Gauss-Markov theorem, [?] 
shows that both bagged and unbagged predictor are unbiased, thus the variance of the unbagged 
predictor is lower than the variance of the bagged one. [?] exhibit general quadratic statistics for 
which the bagged predictor increase both variance and bias. Thus, we propose to estimate directly 
the generalization error of the subagged predictor by an adapted cross-validation procedure. The 
latter is inspired by [?], who proposed to use the left-out example of the bootstrap samples. 

In the general setting, the cross-validation procedures include leave-one-out cross-validation, fc-fold 
cross-validation, hold-out cross-validation (or split sample), leave-u-out cross-validation (or Monte 
Carlo cross-validation or bootstrap cross-validation). With the exception of BUR89j, theoretical 
investigations of multifold cross-validation procedures have first concentrated on linear models (|Li87 
: |SHA()93j ; [ZHA93] ). Results of |DGL96j an d |GYO02] are discussed in Section 3. The first finite 
sample results are due to Wagner and Devroye |DEWA79] and concern fc-local rules algorithms under 
leave-one-out and hold-out cross-validation. More recently, [HOL96| IHOL96bTs] derived finite sample 
results for u-out cross-validation, k— fold cross-validation, and leave-one-out cross-validation for ERM 
over a class of predictors with finite VC-dimension in the realisable case (the generalization error 
is equal to zero). BKL99 have emphasized when k— fold can beat v-out cross-validation in the 
particular case of fc-fold predictor. |KR99j has extended such results in the case of stable algorithms 
for the leave-one-out cross-validation procedure. [KEA95] also derived results for hold-out cross- 
validation for ERM, but their arguments rely on the traditional notion of VC-dimension. In the 
particular case of ERM over a class of predictors with finite VC-dimension but with general cross- 
validation procedures, [?] derived probability upper bounds. [?] derived upper bounds for general 
cross-validation estimate of the generalization error of stable predictors that do no make reference to 
VC-dimension. However, these bounds obtained are called "sanity check bounds" since they are not 
better than classical Vapnik-Chernovenkis's bounds. 

We introduce our main result for symmetric cross-validation procedures (i.e. the probability for 
an observation to be in the test set is independent of its index) in the special case of empirical risk 
minimization (ERM). We divide the learning sample into two samples: the training sample and the 
test sample, to be defined below. We denote by p n the percentage of elements in the test sample. 
Suppose that T-L holds, to be defined below. Suppose also that <f) n is an empirical risk minimizer. 
Then, we have for all e > 0, 

Pr(-Rn($£) - BSv > e) < ram(B En , M (n,p n ,s),V E RM(n,Pn,e)) < 1, 

with 

• B ERM (n,p n ,e) = min((2np„ + 1) 4 ^/p~ exp(-ne 2 ), (2n(l - Pn ) + 1) exp(-ne 2 /9)) 

• Verm (n,p n ,e) = exp(-2np„e 2 ). 

The term B(n,p n ,e) is a Vapnik-Chernovenkis-type bound controlled by the size of the training sample 
n(l — p n ) whereas the term V(n,p n ,e) is the minimum between a Hoeffding-type term controlled by 
the size of the test sample np n , a polynomial term controlled by the size of the training sample. 
This bound can be interpreted as a quantitative answer to a trade-off issue. As the percentage of 
observations in the test sample p n increases, the term V(n,p n ,e) decreases but the term B(n,p ni e) 
increases. Other similar bounds are derived for infinite VC-dimension machine learning in the stability 
framework. 

The main interest of the previous results is in the following 

• our bounds are valid for machine learning with both finite and infinite VC-dimension. In the 
latter, it is sufficient that the machine learning satisfies some stablity property as introduced 
in chapter 2. As a motivation, we quote the following list of algorithms satisfying stability 
properties: regularization networks, ERM, k-nearest rules, boosting. 
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• our bounds are strictly less than 1 for any size of learning set. Thus it is also valid for small 



Using these probability bounds, we can then deduce that the expectation of the difference between 
the generalization error and the cross-validation estimate 



Eventually, we define a splitting rule on how to chose the percentage of elements p* n in the test sample 
in order to get both a low generalization error together with a good approximation rate. We derive 
for this optimal choice of p* a bound of the form 



Pr(Rn($Z'*) - Rgtf(p*) >e) = O n ((n + lf Vc cxp(-2n( £ - l^W^ 2 ^M^Jnf j{\ - cxp(-2e 2 )). 



The paper is organized as follows. We detail the main cross-validation procedures and we summarize 
the previous results for the estimation of generalization error. In Section 3, we introduce the main 
notations and definitions. Finally, in Section 4, we introduce our results, in terms of concentration 
inequalities. 

2 Main notations 

In the following, we follow the notations of cross-validation introduced in [?]. 

We will consider the following shorter notations inspired by the literature on empirical processes. In 
the sequel, we will denote Z := X x y, and (/?j)i<j<„ := ((Xi, li))i<»<n the learning set. For a 
given loss function L and a given class of predictors Q 7 we define a new class F of functions from 
Z to R+ by T := {ijj <= Rf \ip(Z) = L(Y, (f>(X)),(f> <= G}- For a machine learning $, we have the 
natural definition ^(Z, T> n ) — L(Y,$(X,T> n )). With these notations, the conditional risk R n is the 
expectation of ^(Z, T> n ) with respect to P conditionally on V n : R n := Mz[^(Z,V n ) \ V n \ with Z <~ P 
independent of T> n . In the following, if there is no ambiguity, we will also allow the following notation 
ip(X,V n ) instead of V(X,V n ). 

To define the accurate type of cross-validation procedure, we introduce binary vectors. Let V n = 
(V n ,i)i<i<n be a vector of size n. V n is a binary vector if for all 1 < i < n,V n s G {0,1} and 
^ Sr=i ^">* ^ 0- Consequently, we can define the subsample associated with it: T>v n '■— {Zi € 
T^nSYn.i = !,!<«< n}. We define a weighted empirical measure on Z 



with the Dirac measure at {Zi}. We also define a weighted empirical error ¥ n y n tp where W n y n ij) 
stands for the usual notation of the expectation of ip with respect to P n ,v„- For P n ,i n) with 1„ the 
binary vector of size n with 1 at every coordinate, we will use the traditional notation P„. For a 
predictor trained on a subsample, we define 



With the previous notations, notice that the predictor trained on the learning set ip(.,T> n ) can be 
denoted by ipi n ( )- We will allow the simpler notation ip n {-)- The learning set is divided into two 
disjoint sets: the training set of size n(l — p n ) and the test set of size np n , where p n is the percentage 
of elements in the test set. To represent the training set, we define V* r a random binary vector of size 



samples. 





W n (.) :=*(,xyj. 
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n independent of V n . V* r is called the training vector. We define the test vector by V^ s := l n — V* r 
to represent the test set. 

The distribution of V* r characterizes all the subagging procedures described in the previous section. 
Using our notations, we can now define the bagged predictor. 

Definition 1 (Subagged regressor) The subagged predictor build from <f> n denoted is defined 
by: 

:=Ev B .r^Jr(.). 

In the case of classifiers, the bagging rule corresponds to the vote by majority. We suppose in this 
case that y = {I,...,M}. 

Definition 2 (Subagged classifier) Cross-validated subagged classifiers of <j>^ defined by: 

<l>?(X):=axg min E V trL(k, $(X, XV"-)) 
fee{i,...,M} 

We can now define the cross-validation estimator. 

Definition 3 (Cross-validated subagged estimator) Cross-validated subagged estimates of <p^ 
denoted can be defined in two different ways by: 

Rcvi^n ) :=E Vf£ rP nW (W n *r) 

and 

Remark 4 Recall that E v trP n V ts^ v t r ) is the conditional expectation of P n) yt« (Vv* r ) with respect 
to the random vector V* r given T> n . 

Remark 5 The cross-validated subagged estimate differs from the usual cross-validation estimate of 
Rcvi^n) which is equal to E;7trP„ ; [/t s (V'^tr) with U%" the training vector as defined in chapter 1. 

We will give here a few examples of distributions of V* r to show we retrieve subagging procedures 
described previously. Suppose n/k is an integer. The fc-fold subagging procedure divides the data 
into k equally sized folds. It then produces a predictor by training on k — I folds. This is repeated 
for each fold, and the trained predictors are averaged to form the subagged predictor. 

Example 6 (fc-fold cross-validation) 



We provide another popular example: the leave-one-out cross-validation. In leave-one-out cross- 
validation, a single sample of size n is used. Each member of the sample in turn is removed, the full 
modeling method is applied to the remaining n — 1 members, and the fitted model is applied to the 
hold-backmcmber . 
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( (^0 , L^l )) = - 

n/k observations n(l-lfk) observations 

( , (V_0 , L__^ 1 I )) = l 

n/k observations n/k observations n(l — 2/k) observations 

( L__^ 1 I , (^0 )) = i. 

n(l — 1/fc) observations n/k observations 



Example 7 (leave-one-out cross-validation) 

Pr(Vf = (0,1,.. 
Pr(C = (1,0,1, 



Pr(C = (1,.. .,1,0)) = i. 

3 Results for the cross-validated subagged regressor 

3.1 VC Framework 

3.1.1 Notations and definition 

We denote by R opt the minimal generalization error attained among the class of predictors C, R opt = 
inf^gc R{<fi)- In the sequel, we suppose that <j) n belongs to some C. Notice that R opt is a parameter of 
the unknown distribution P(x,r) whereas R n is a random variable. 

At last, recall the definitions of: 

Definition 8 (Shatter coefficients) Let A be a collection of measurable sets. For (zi,. ..,£„) € 
{R d }" 7 let N^{z\^...z n ) be the number of differents sets in 

{{z u . . . , z n } H A; A € A} 

The n-shatter coefficient of A is 

S(A,n) = max Na(zi z n ) 

(zi,...,z„)G{R d }" 

That is, the shatter coefficient is the maximal number of different subsets of n points that can be 
picked out by the class of sets A. 

and 

Definition 9 (VC dimension) Let A be a collection of sets with A > 2. The largest integer k > 1 
for which S(A,k) — 2 k is denoted by Vc, and it is called the Vapnik-Chernovenkis dimension (or VC 
dimension) of the class A. If S{A,n) = 2" for all n, then by definition Vc = oo. 

A class of predictors C is said to have a finite VC-dimension Vc if the dimension of the collection of 
sets {A,p_ t : <j> e C,t e [0,1]} is equal to Vc, where A^ >t = {(x,y)/L(y,(p(x)) > t}. 

3.1.2 Results 

In the sequel, we suppose that the cross-validation is symmetric (i.e. Pr(V^ ! i = 1) is independent 
of i) and the number of elements in the training set is constant and equal to np n , that the training 
sample and the test sample are disjoint and that the number of observations in the training sample 
and in the test sample are respectively n(l —p n ) and np n . Moreover, we suppose also that <p n belongs 
to a class of predictor with finite VC-dimension. Suppose also that L is bounded in the following 
way: L(Y, (j){X)) < C(h(Y,<j>(X)) with C convex function -bounded itself by 1 on the support of 
h(Y, 4>v<-r (X)) for simplicity-, and h such that for any < A < 1, we have h(y, X<f>(xi) + (1 — X)(p(x2) < 
Xh(y, <p{xi) + (1 — X)h(y, 4>{x2)- We will also suppose that the predictors are symmetric according to 
the training sample, i.e. the predictor does not depend on the order of the observations in T> n . We 
denote these hypotheses by %. 
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Remark 10 Typical upperbounding convex cost functions are : the hinge loss C(x) = (1 + x) + , the 
exponential loss C(x) = e x , the logit loss C{x) — log 2 (l + e x ). 

We will show upper bounds of the kind Pr(i?„($^) — R^y > e) < mm(B(n,p n , e), V(n,p n , e)) with 
£ > 0. The term B(n,p n ,e) is a Vapnik-Chernovenkis-type bound whereas the term V(n,p n ,e) is a 
Hoeffding-type term controlled by the size of the test sample np n . This bound can be interpreted as 
a quantitative answer to a trade-off question. As the percentage of observations in the test sample p n 
increases, the V(n,p n ,e) term decreases but the B(n,p n ,e) term increases. 

Theorem 11 (Absolute error for symmetric cross-validation) Suppose that % holds. Then, 
we have for all e > 0, 

(n,p n ,e)) < 1 

with 

. B sym (n,p n ,e) = (2n Pn + i)4V c / P „ e -n E 2 
• V sym (n,p n ,e) = exp(-2np„e 2 ). 

Remark 12 We do not require <j) n to be an empirical risk minimizer. 
Proof. 

We have Rn(&%) = P^,f = FL(Y,E v tr^) V tr(X)). Since C is a convex function -bounded itself by 1 
on the support of h(Y, 4>vtr{X))-, and h linear in the second variable, we get 

Rn(^) < PC(h(Y,E V tr<j> V tr(X)) < E V <rFC(h(Y,<f> V <r(X)) 

Then, we split according to E v trF n yts C(h(Y, <f> v tr(X)): 

Rn(*%) < E VJt rF niV *.C(h(Y, <j> v *r(X)) + E V tr(F - P ntVJt .)C(h(Y, <j> V *r(X)) 

= B%$ + E V tr (P - F n y<,s )C(h(Y, <j> Vi r (X) 

Thus, we obtain: Pr(E„(?/> r f ) - R^y > e) < Pv(E v tr(F-F n yts)C{h{Y,cp v t r (X) > e). 
To prove our result, we proceed now in two steps. For this, we consider 

E V tr(F n ytsC(h(Y, 4> v *r(X)) - FC(h(Y, 4> v *r(X))) 

in two different ways 

1. using conditional Hocffding's inequality, 

2. using Vapnik-Chernovenkis-type inequality to bound the supremum over a class. 

1. First, by conditional Hoeffding arguments (for a proof, see e.g. chapter 1), 

Pr(£ n ($n) - Rev > e) < exp(-2n Pn e 2 ). 
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2. Secondly, we derive the bound: 

Pr(i?„($f ) - R%tf >e)< Pr(E^(P-P„,^ s )C(M^^(X)) > e) 

< Pr(E V tr sup(P - P„ yt3 )C(/i(r, 0(A)) > e). 

Recall a useful lemma (for the proof, see Appendices). 
Lemma 13 Under the assumptions %, we have for all, e > 0, 

Pr(E vtr sup(P-F n y tr )C(h(Y,0(X)) > e) < {S{2np ni C)fl p -e- ne \ 

and we also have (for the proof, see e.g. 1DGL960 : Vn,<S(n,C) < (n + l) Vc . 
Thus, it follows that Pr(#„($f ) - R^y > e) < (2np„ + l) 4V e/P"e-™ £2 . 

Putting altogether, we get Pr(R n (^) - R^y > s) < min(exp(-2np„£ 2 ), (2np n + X) Wc I e~ ne * ) . 
□ 

Theorem 14 (Absolute error for symmetric cross-validation) Suppose that H holds. Then, 
we have for all e > 0, 

Pr(# n ($f ) - R'cv >e)< mm{B sym (n,p n ,e),V svm {n 7 p n ,e)) < 1 

with 

• B sym (n,p n ,e) = (2n(l -p n ) + l)J-*»e _Tls 

• V sym (n,p n ,e) — exp(-2np„£ 2 ). 
Proof. 

We proceed as previously: Rn($%) = = FL(Y,E v t r ^ v tr(X)) < FC(h(Y,E v tr(f) V tr(X)) < 

E V tr¥C(h(Y,4> V tr(X). 

We then split this quantity according to EytrF nt ytrC(h(Y, <f>ytr[X) 

Rni^n) < ^VtrF n! ytrC(h(Y, <j)ytr (X) + Ey^(P - F n yts)C(h(Y, (f)yt r (X) 
= figy + Eytr (P - V n .y^)C(h(Y, (j>ytr (A)). 

Thus, we get 

Pr(5»($?) - R£v >e)< Pr(E v t r {F-F n ,yi.r)C(h(Yct>vtr(X)) > e) 
< Pr(E v tr sup(F-F ntV tr)C(h(Y,(f>(X)) > e). 

" <p£C 

Recall two useful results (for the proof, see e.g. chapter 1) 
Lemma 15 Under the assumptions %, we have for all e > 0, 

Pr (E^,. sup(P(0) -F n>v tr(<t>)) >e)< (5(2n(l - Pn ), C)) 4 /(i-P") e -«(i-P^ 2 . 
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□ 

In the special case of empirical risk minimization, we can obtain a stronger result. 

Theorem 16 (Absolute error for symmetric cross-validation) Suppose that TL holds. Suppose 
also that 4> n is based on empirical risk minimization. But instead of minimizing R n {<j>), we suppose 
4> n minimizes ^ Ym=i C{h{Yi, <j){Xi)). For simplicity, we suppose the infimum is attained i.e. <p n = 
argmin^gc \ ^{h{Yi, (j)(Xi)). Then, we have for all e > 0, 

Pr(^„($,f ) - R^v >e)< mm(B ER M{n,p n ,s),V E RM(n,p„,e)) < 1, 

with 

• B ERM {n, Pn ,e) = min((2np„ + 1) 4 ^/p- e xp(-ne 2 ), (2n(l - p n ) + 1) ~ Pn exp(-ne 2 /9)) 

• V E RM(n,p n ,e) = cxp(-2np n e 2 ). 



Remark 17 1. The assumption <j> n — argmin^ge - C{h(Yi,(j)(Xi)) is not so restrictive, since 

in practice in order to numerically minimizes ^ Ym=i L(Yi, (f>(Xi)), one looks for C convex such 
that for all x, y, L(y, <j>(x)) < C(h(y, <j)(x)). 

2. Thanks to the Hoeffding's part, the bound is always smaller than 1, so it remains valid for small 
samples. For bigger samples, we will prefer the Vapnik-Chernovenkis's part. 

Proof. 

Appying the previous result, we have Pr(j?„($^)— R^y > e) < min(exp(— 2np n e 2 ), (2np n +l) 4Vc / pn exp(— ne 2 )). 
Recall that R n {$%) - R%v < E v tr(PC(h(Y,4> v t r {X)) - P n , v tsC(h(Y,cf) V tr(X))). 

We need the following lemma (for a proof, see chapter 1): E v trP n ytsC(h(Y, <f>v*r(X)) > P n C(h(Y, <p n (X)) 
since <j) n = argmin 0eC i YJi=i c {K Y iA{ x i))- 

Denote tf>(Z) := C(h(Y, (f>(X))) with Z := (X, Y). We have the following natural notation ip v t r (Z) := 
C(h(Y,4> v t r {X))). 

We thus get 

Pr(R n (^) - R$# > 3s) < Pr(E V t r (FiP V t r -¥ n , V tsTp V t r ) > 3s) < Pr^r (P^tr-P^n) > 3e) 
and by splitting according to Pip op t, we have: 

Pv{Rn{$%) _ ROut > 3^ < Pv(E V tr(¥iP V tr-¥ niV trlP V tr+¥ ntV tril; V tr - P^ opt +P^opt -P„Vn ) > 3s) 

< Pr(E v *r sup(PV>-P„,y n tri/>) > s) + Pr(sup(P„ j y t .V-PV') > e) 
+ Pr(sup(PV>-P„V) > e). 

Recall the following lemma (for the proof, see e.g. chapter 1), 
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Lemma 18 Under the assumption of Proposition 11, we have for all e > 0, 



Pr(E ytr sup(P„,vrV-IPV>) > e) < (5(2n(l - p n ), C))^e" 



and symmetrically 

Pr(E ytr sup(PV>-P„,y*rV) > e) < (5(2n(l - p„),C))T^e-™ e2 . 

Then, we get 

Pr(£„(V>*) - R° c y > 3e) < 2(5(2n(l - p n ),C))^ e -" £2 + (5(2n,C)) 4 e -" £2 

<3(2n(l-p„) + l) T ^e^ n£2 . 

This implies in turn that 

Pr(i?„(^f ) - R° c f >e)< (2n(l -p„) + 1)^ exp(-n £ 2 /9). 

Putting altogether, we get 

Pr(i?„(^) - R%y > e) < min(exp(-2np„£ 2 ), (2n Pn + l)«Wp„ e -«e 2 ^ 

(2n(l-p„) + l) T ^ r cxp(-n £ 2 /9)) 



□ 

Theorem 19 Suppose that % holds. Suppose also and that n/k is an integer. Then, we have also for 
all e > 0, 

PrOM^f) - R%¥ >£)< mm(B k (n, Pn ,e),V k (n, Pn ,e)) 

with 

• B k (n,p n ,e) = (2n/k+l) 4kVc exp(-ne 2 ) 

• Vk(n,p n ,£) = min ( exp( — 2n/ke 2 ), 2~p^ exp 



64(vAfc]n(2(2n/fc + l)) + 2) 
Proof. 

The proofs starts as previously. We have 

PriR'Sv ~ Rn(ipn) >s)< Pr(E v *r(¥ ntV *.^ v *r-¥^ V i t r) >e)< exp(-2np„e 2 ) 
but we also have 

Pv(R%$ - R n {i>n) > e) < Pr(E v «r(sup(P ni v«.^-P^) > e) 



i 

< 2?" exp 



ne 2 



64(^0 hi(2(2np n + l)) + 2) 



according to chapter 1. 
□ 

Following the previous results, we can obtain results for the expectation of the difference R n (tp^) 

f>Out 

n cv 
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Theorem 20 (L x error) Suppose that % holds. Suppose also and that n/k is an integer. Then, we 
have also for all e > 0, 

E©„ (tfnW'n ) - Rev) < \/Vnp n 

Furthermore, suppose also that 4> n is based on empirical risk minimization. But instead of minimizing 
R n (<fi), we suppose <p n minimizes — 'Y^ = iC(h{Yi,4>{Xi)). For simplicity, we suppose the infimum is 
attained i.e. 4> n — argmin^gc h SlLi C(h(Yi, <fr(Xi)). Then, we have, 

\ J y n{l-p n ) 

Proof. 

We just need to apply the previous results together with the following useful lemma (for a proof, see 
e.g. |DGL96p : 

Lemma 21 Let X be a nonnegative random variable. Let K,C nonnegative real such that C > 1. 
Suppose that for all e > 0, W(X > e) < Cexp(— Ke 2 ). Then, we have 



□ 

3.2 Stability framework 

3.2.1 Introduction to stability 

To avoid the traditional analysis in the VC framework, notions of stability have been intensively 
worked through in the late 90's [KEA95] . [BEOlj . |BE02j . |KUT02j . and |KUNIY02j . The object 
of stability framework is the learning algorithm rather than the space of classifiers. The learning 
algorithm is a map (effective procedure) from data sets to classifiers. An algorithm is stable at a 
learning set T> n if changing one point in T> n yields only a small change in the output hypothesis. 
Several different notions of algorithmic stability are described. The attraction of such an approach is 
that it avoids the traditional notion of VC-dimension, and allows to focus on a wider class of learning 
algorithms than empirical risk minimization. For example, this approach provides generalization error 
bounds for regularization-based learning algorithms that have been difficult to analyze within the VC 
framework such as boosting. If a map is stable, exponential bounds on generalization error may be 
obtained. As a motivation, we quote the following list of algorithms satisfying stability properties: 
regularization networks, ERM, k- nearest rules, boosting. 

3.2.2 Definitions and notations of stability 

The basic idea is that an algorithm is stable at a training set T> n if changing one point in T> n yields only 
a small change in the output hypothesis. Formally, a learning algorithm maps a weighted training set 
into a predictor space. Thus, stability can be translated into a Lipschitz condition for this mapping 
with high probability. 

To be more formal, following [?], we define a distance between two weighted empirical errors: 

Definition 22 (Total variation) Let P n ,v„ o,nd P n ,C7„ be two empirical measures on Z with respect 
to the binary vectors V n and U n . We do not assume their support to be equal. The distance between 
them is defined as their total variation: 

\K.U n - Pn,V„|| = Sup |0P n> D- B - P n ,V n )(A)\. 

Aev(z) 
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Example 23 In the case of leave- one- out (i.e. Yli=i U n .i ~ n ^^)j we have: 




In the case of leave-v-out, we get: 




At least, we need a distance d on the set J- '. Let us quote three important examples. Let ipi,ip2 
6 T. The uniform distance is defined by: doo^ij V^) = sup Zl£Z \ipi{Z) ~ ip2{Z)\, the Li-distance 
by: di(tf)i, tp 2 ) = P|^i — V2 1 j the error-distance d e {tp\,ip2) — — ^2)|< It is important to notice 

that what matters here is not an absolute distance between the original class of predictors Q seen as 
functions but the distance with the respect to the loss or/and the distribution P. In particular, for 
the Li-distance, we do not care about the behavior of the original predictors <f>\ and fa outside the 
support of P. At last, notice that we always have d e < d\ < d^,. 



We are now in position to define the different notions of stability of a learning algorithm which cover 
notions introduced by [KUNIY02]. We begin with the notion of weak stability. In essence, it says 
that for any given resampling vectors, the distance between two predictors is controlled with high 
probability by the distance between the resampling vectors. As a motivation, notice that algorithms 
such as Adaboost f [KUNlY02] ) satisfies this property. With the previous notations, we have: 

Definition 24 (Weak stability) Let T> n = (Zi)i<i<„ be a learning set. Let A, (<5 n ,p„)n,p„ be non- 
negative real numbers. A learning algorithm ^ is said to be weak (A, (S n _ Pn ) n ^ Pn ,d) stable if for any 
training vector U n whose sum is equal to n(l — p n ): 



Pr(d(^ a „,^n) > A||P„,c„ -P„||) < S. 



Notice that in the former definition Pr stands for P® n . Indeed, ip n is trained with n observations, 
drawn independently from P. A stronger notion is to consider tp n trained with n— 1 observations drawn 
independently from P and an additionnal general observation z. We consider the stronger notion of 
strong stability. As a motivation, notice that algorithms such as Empirical Risk Minimization with 
finite VC dimension ( KUNIY02 ) satisfies this property. 

Definition 25 (Strong stability) Let z £ Z. Let T> n — 2?„_i U {z} be a learning set. Let 
A, (dn,p n )n,p„ be nonnegative real numbers. A learning algorithm ^ is said to be strong (A, (S„ tPn ) ntPn , d) 
stable if for any training vector U n whose sum is equal to n(l ~p n )'- 

Pr(d(Vc/„>™) > A||P„,t/„-P„||) < <W 

What we have in mind for classical algorithms is 5 n , Pn = O n (p n exp(— n(l — p n )). We can state the last 
definition in other words. Let V£ r be a training vector with distribution Q such that the number of 
elements in the training set is constant and equal to n(l — p n ). Notice then that the former definition 
also implies that sup^g^ppo^Q) P( w~4^T| > ^) < $n.p„, where support(Q) stands for the support 
of Q. The previous notion stands for any U n having the same support of Q. A stronger hypothesis 
would be that the previous probability stands uniformly over U n in support (Q). This leads formally 
to the notion of cross-validation stability. To be more accurate: 

Definition 26 (Cross-validation weak stability) Let T> n = (Zi)i<i< n a learning set. Let V^ r 
a training vector with distribution Q. Let \(o~n,p„)n,p n be nonnegative real numbers. A learning 
algorithm is said to be weak (A, {o~n,p n )n,p n , d, Q) stable if it is weak (A, {o~n,p n )n,p n i d) stable and if: 

Pr ( SU P mp jp-|j > A) < 8 ntPn . 

U n £support(Q>) I Fn, C7„ r n\\ 
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As before, we also define the following stronger notion: 



Definition 27 (Cross-validation strong stability) Let z € Z. Let T> n = T> n -i U {z} a learning 
set. Let V,l r a cross-validation vector with distribution Q. A learning algorithm \Jj is said to be strongly 
(A, (S n>Pn )n,p„ > d, Q) stable if it is strong (A, (<y n ,p n )n,p„ > d) stable and if: 

Pr( SUP p-fT 

Remark 28 J/ f/ie cardinal of the support of Q is denoted n{n), then a learning algorithm which is 
weak (A, (5 ni p n ) n>Pn , d, Q) -stable is also strong (A, (K(n)8 n , Pn )„, <i, Q)-stable. 

As seen in the following table, we retrieve with those notations the different notions of stability 
introduced by |DEWA79j . |KEA95j and also jBEOlj . [KUNIY02] . 



stability distance 




di 


d e 


Weak 


weak (A, S^j hypothesis stability 
IKUNIY02I 


weak (A, S) Z/]. stability 
IKUNIY02I 


weak (A, 5) error stability 
IKUNIY02I 


Strong 


strong (A^ S^j hypothesis stability 
IKUNIY02IIDEWA79I 


strong (A, (5) instability 
IKUNIY02I 


strong (A, (5) error stability 
IKUNIY02I 


Sure Stability 


uniform stability 
IBE01I 


IDEWA79I 


error stability 
IKEA95I 



To motivate this approach, we also quote a list of class of predictors satisfying the previous stability 
conditions. 



stability distance 


doc 


di 


d e 


Weak 






Lasso 


Strong 


Adaboost CikuniycqI) 


-ERM ( KUNiY02l) 

-/c-nearest rule 


Bayesian algorithm 
[KEA95] 


Uniform 


Regularization networks 







We recall the main notations and definitions: 



Name 


Notation 


Definition 




Risk or generalization error 


R,, 


E P [L{YA{X,D n )) | 


D n ] 


Resubstitution error 


Rn 




Dn)) 


Cross-validation error 


Rev 


EytrP n ytslpyt r 





Table 1: Main notations 



3.2.3 Main results 

Let V n be a learning set of size n. Let V^ r ~ Q be a training vector independent of T> n such that the 
cross-validation is symmetric and the number of elements in the training set is constant and equal to 
np n . Let d be a distance among d e , d\ , d^ . At last, we suppose that the loss function L is bounded 
by 1. We derive the following general results that stands for general cross-validation procedures and 
stable algorithms. 
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Theorem 29 (Cross-validation Strong stability) Suppose that % holds. Let ^ a machine learn- 
ing which is strong (A, (S n ,p n ) n ,p n ,Q) stable with respect to the distance d. Then, for all e > 0, we 
have: 

Pr (Rn($n) - Rev > e) < cxp(-2np„e 2 ) 
Furthermore, if d is the uniform distance doo, then we have for all a > 0: 

2 

Pr (£„(<&*) R°f >e)< min(exp(-2np» £ 2 ), 2(cxp(- 8n(8An ^ + q)2 ) + ^<W) 
Thus, if we choose a — SXnp n , 

2 

Pr(iU^) - R°# >e)< min(exp(-2np„ £ 2 ), 2(exp(- ) + ^<WJ) 

Proof. 

On the one hand, we have as before by conditional Hoeffding's inequality (for a proof, see e.g. chapter 

Pr(i?„($f ) - R^v > e) < Pr(E Vf jr(P^ V ' t jr-P niVX .^ Vi{ r) > e) < exp(-2np„ £ 2 ) 
On the other hand, notice that P®"E Vn tr(PVv n *r-P„,v^Vv,r) = 

Denote f{Z ll Z 2l ■ . ■ , Z n ) := Ey ) tr(P^ r - Pn.V"^ V'y t * r )- Let z e Z. Now denote: 

B := { Sup MP P N - A} 

t/„6support(Q) Wn,U n ~ r n+l|| 

with ipn+i trained on X>„+i = {Zi, ■ ■ . , Zj, • • • , Z„, z}. Under our assumptions, we have 
Pr(B)<<S n+liPB+1 . 

We want to show that with high probability there exist constants c, such that for all i G {1, . . . , n}, 
for all z € Z, 

A, : — \.f(Zi, ...,Zi,..., Z n ) — f(Z\, . . . , Zj_i, z, Zj + i, . . . , Z n )\ < Ci 

Notice that: 



|Ai| = |E V ^(PV'V^-P n ,V n «.^V t Jr) - (E^rlYvJr - ^,^^01 
< |E Vf JrP(^ Vf Jr - ip'ytr)\ + \¥. V t r (F n yts^ V tr-F n yt,Flp V tr)\ 

with P n vtr the weighted empirical measure on the sample 

£n = {Zl, . . . , Zj_i, Z, Zi + i, . . . , Z„} 

and V'ytr the predictor trained on r • 

So, first, let us bound the first term, lEy^rP^y-^r — Vv* r )l < E^IP^y^r — tp n +i)\ + Ey^ |P(V>n+i 
Vv*»-)l- Thus, on B c , we have lEy^P^y^ - ip V tr)\ < 

To upper bound the second term, notice that: 
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\ E V*rV niV tslP V tr -E V t r V n ytstp V tr- \ = \ Eytr (P n , yts (iPytr - l\}y^) I V; £ = 1) X (1-P„) 

+ E^((P„, y ^ - P^ iVt .)^V B «r|^ = 1) X Pn | 



We always have for any ip, |(P n ,v^ - W' n yts)4>\ < l/np n thus \E v tr((P n yts - W' n y t s )tp v tr ,V* 
1) x p n \ < 1/n 



We still have to bound |E^r(P„ ; ^ s (^/tr— ^iy tt .)|V^ = 1)1 which is always smaller than E V tr(d oc (ipv* r i' i Pvtr)\% 
1) in the special case of the most stable kind of stability namely the uniform stability. 

On B c , we get d 00 (ip v *'-,'>Pv* r ) - rf °°(Vv„f, ip n +i) + doo(^n+i, i> v tr) < 4Ap„. 
Thus, on B G , we derive 

^(doo^v^^'v^WZ = 1) < 4Ap„. 
Putting all together, with probability at least 1 — S n ^ Pn , we get 

4A 

sup \f{Z x ,...,Z i ,...,Z n )-f{Z x ,...,z,...,Z n )\ < —r+4,Xp n (l- Pn )<8X Pn . 

l<i<n,zeZ n + 1 

Applying theorem ??, we obtain that for all e > 0: 

Pr(E^(P^-P„,..^) > e) < 2(ex P (- 8n(8A ^ + Q)2 ) + £0 

e 1 - ■ 



n.i 



* 2(6XP( ~ 8(16A)^ ) + 8^»"»> by taking a = 8AP " 



□ 



Theorem 30 (Cross-validation Weak stability) Suppose that'll holds. Let^> be a machine learn- 
ing which is weak (A, (S n _ Pn ) n _ Pn ,Q) stable with respect to the distance d. Then, for all e > 0, we have 

Pr(i?„($n) - Rev > <0 < exp(-2np n e 2 ). 
Furthermore, if the distance is the uniform distance doo, we have for all e > 0: 



i(5 



1/2 



Pr(i?„(^) - flg# > e) < min(cxp(-2np„ £ 2 ), 2(exp(- w(9 Z Pn) * + exp( 4(9 /^ )2 )) + 

TlSn{p n )- 

Proof. 

Denote /(Zi, Z 2 , . . . , Z„) := i?g^ - i?„ and B := {sup [/n£support(Q) ^^-p^! \\ ^ A } with V>n+i 
trained on D n+ i = {Zi, . . . , Z;, Z j+ i, . . . , Z„, ZJ. 

We want to show that for alii, there exists constant Ci such |A;| := |/(Zi, . . . , Zi, . . . , Z n )—f(Z\, . . . , Z i , . . . , Z n )\ < 
Ci with high probability where Z\, . . . , Zi, . . . , Z n , Z i are i.i.d. variables. 

|A<| - \E V tr(FlP V tr-F n!V tsllj V tr) - (E^P^, - P^^P^I 
< E V *r\F(lfo*r - tl)' v tr)\ +E V tr\(F niV ts^ V tr-F n V t a Ftp' v , r )\. 

with F n , F n y ts the weighted empirical measures of the sample V n — {Zi, . . . , Z i , . . . , Z n } and ip n the 
predictor built on V n . 
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So, first, let us bound the first term, lEy^-P^v^ — if) V tr)\ < E v tr-\F(ip v * r — ^n+l)| + ^v* r I^CVVi+l — 
tp' vr )\ Thus, on B c , we have |EvyP(^ - ^ r )| < 
To upper bound the second term, notice that: 

\E v trP n!V tsi; vr -E v trP^ v ^' v , r \= |E v *-(P njV ^(^ v; j P -0^r),V^= 1) x (1 -pj 

+E^((P„ ) ^-P' n> ^ 8 )V'v- ! < S = 1 ) x P„l- 
We always have for all tp, \(P n yts — P n vts )i()\ < l/np n thus we get 

\E V tr((F niV ts - P' n>v j.,)ip vr , V* a = 1) x Pn \ < l/n. 

We still have to bound {E^r (P n<v t. (tp v tr - i;' V t r ),V* r = 1)| < E v tr(doo(ipv^,^'vt r ),V* r = 1) in the 
special of the uniform stability. 

On B c , we derive (tpv* r , ^ytr ) - ^ooWv^i V'n+i) + dooWVf-i) VV"0 - 4Ap„, thus on B c 

E V rtr(d 0Q (^,^{r),V^ r = 1) < 4A Pn . 

Putting all together, with probability at least 1 — <5„ jPrl , 

\f(Zi, ...,Zi,...,Z n )- f(Zi, ...,Zv,..., Z n )\ < 8Xp n . 

□ 

Following the previous results, we can obtain results for the expectation of the difference i? n ($^) — 

fyOut 

Theorem 31 In the case of classification, we can bound the excess risk by 

- R cv) < y/l/npn 
Furthermore, if d is the uniform distance d^, then we have for all a > 0: 

E Vn {R n {^ ) - R%#) < min( v /l/np„, v / 16^Ap„ + -r^—S n , Pn ) 

4Ap« 

Similar results can be derived in the context of the weak stability. 
Proof 

It is sufficient to apply the previous probability upper bounds together with the lemma PHI 

□ 

4 Results for the cross-validated subagged classification 

In the case of subagging of classifiers (i.e. the majority vote), we can obtain the following results: 
Theorem 32 For any subbaged classifier, we can bound the excess risk. 

Pr(i?„($f ) - l -R° c ^ >e)< exp(-8n Pn e 2 /9) 

and also 

Pr(P„($f) - >e)< Zexp(-2np n e 2 /9) 
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where N denotes the total number of training vectors in the cross-validation andl denotes [(N— l)/2]+l 
that is the strict majority of the subbaged classifiers and R^v the cross-validated estimate of this 
majority. 

Furthermore, in the particular case of binary classification we also have 

Pr(i?„($f ) - (i?g y >/2 - 1/2)) < -e) < cxp(-2np„e 2 /9) 

and 

Pr(R n ($g) - (lR^ j - I + 1) < s) < I cxp(-2np„e 2 ) 

Proof. 

We consider a ghost sample i.i.d. of size m: (X 1 ,Y 1 ), (X m ,Y m ). Denote m := L(Y i , ^(X^). 
Then eg := ^ YllLi Vi corresponds to the average number of mistakes of <f>g on the ghost sample. In 

the same way, eS := ^ YT= i L ( Y t ,4>vf (*,')) (respectively e a m := E v tr YT= 1 L ( Y i ^v^iKW is 
the average number of the mistakes of (p v * r (respectively the weighted average number of mistakes of 
the family of predictors 0ytr). 



Denote by 




1. L X := Rn{$g) - IkZf 




2. L 2 := R n (<Pg) - eg 




3. L 3 := eg - e° m /2 




4. U := |K - Ex^ytrLiY^vtr 


{X))\ 


5. i 5 := |[£7 x ,yE Vt jrL(Y;^rjr(X)) 


T)Out 



We have 

Pr(L x > 3e) < Pr(L 2 > e) + Pr(L 3 > 0) + Pr(L 4 > e) + Pr(i 5 > e) 
By Hoeffding's inequality, we have: 

Pr(i 2 > e) < exp(-2me 2 ). 
and also Pr(L 4 > e) < exp(-2m(2e) 2 ) 

By conditionnal Hoeffding's inequality (for a proof, sec e.g. [?]), we deduce 

Pr(L 5 > e) < cxp(-2np n (2e) 2 ) 

By conditionnal Hoeffding's inequality, we also have 

Pr(e4 - E x , Y E V i,rL{Y,<t> vi r{X)) > e) < cxp(-2m £ 2 ). 
since for fixed Pr(^ YZi L ( Y i ^(K)) - E x .yL{Y, <f> v *r{X)) > e) < exp(~2me 2 ) 

We suppose here that Pr(V^ r = v n ) are rational numbers whose smallest multiplicator is denoted by 
N. Thus e£j can be seen as a simple average number of mistakes of a family of predictors (4>j)i<j<N 
on the ghost sample. 
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First notice, that if is small then must be small either. Indeed, we have 

N m N 

j=l i=l l<j<JV,l<i<m 

with e, ; j := L(Y^ , e {0, 1}. We thus deduce that the total number of mistakes on the ghost 

sample of the family of predictors (<fij)i<j<N is equal to Nme^. Notice that if the number of mistakes 
of the family {(t>j)i<j<N on the i-th observation is less that |_(JV — 1) /2J (i.e. X^Li e i,j — L(^~ l)/2j ) 
then it means that a strict majority of predictors have classified correctly Y i , which in turns tells us 
that a strict majority of predictors have the same output Y i = 4>j(X i ). We thus have 4>n{^i) = ^ 
which implies nj = L(Yl,<fi(Xl)) = 0. 

Denoting by k — mef n the number of mistakes of the subbaged classifier on the ghost sample, we 
necessarly have 

m N 

£E^, > «(L(JV - 1)/2J + 1) = «(L(JV + 1)/2J). 
i=l i=l 

It follows that 

P B < P a < P a /o 

m ~ L(JV + 1)/2J m m/ ' 

Thus Pr(L 3 > 0) = 

We conclude Pr(R n ($%) - \R°tf > 3e) < exp(-2np„(2e) 2 ) + exp(-2m(2£) 2 ) + exp(-2me 2 ). 
If we let m — > oo, 

Pr(/?„($ I f ) - Iflg"? > e) < cxp(-8np„e 2 /9) 

Notice that in the particular case of the binary classification, we have by symmetry, 1 — < 
L(Af+i)/2j i 1 - which gives 

e — (1 ) < e 

[AT/2 + 1J ' m 1 + 1J m 

and eventually > [jv/ ^ +1J C - 1/2 > ef n - 1/2 

Thus, for binary classification, we can even obtain an probability upper bound for Pr(|i?„($^) — 
3 -Rev I > e) not only for Pr(fi„($f ) - ±i?g y * > e). Indeed, denote by 

1. L[ := R n {^) jwrkr^cv ~ 1/2) 

2. ^ := - e£ 

3. i 3 := - ( [n/2+i\ e m - I/ 2 ) 
N „a 1 /oi i N 



4 - ^ := (pvTt+IjC - 1/2) - ( w % TTJ E X!Y E v trL(Y,ci> v tr(X)) - 1/2) 



5. L 5 



{jWj^Ex.y^LiY^v^X)) - 1/2) - {^j^ROut _ 1/2) 



We get 
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Pr(Li < -3e) < Pr(L 2 < -e) + Pr(L 3 < 0) + Pr(L 4 < -e) + Pr(L 5 < -e) 

< exp(-2m £ 2 ) + + cxp(-2m( ^ + - ef) + cxp(-2np„( ^ + - ef) 

Taking m -> oo, and noticing that N/[N/2 + lj > 1 

Pr(i? n (^) - (i?g^/2 - 1/2)) < - £ ) < Pr(i? n (^) - (i?g^/2 - 1/2) < - £ ) 

< Pr(ii < -e) < cxp(-2np n e 2 /9) 

For binary classification, we can eventually obtain that 

Pr(|i?„($ r f ) - i(i?g^ - 1/2)| > e) < exp(-8np„e 2 /9) +cxp(-2np„e 2 /9) < 2 exp(-2np„e 2 /9) 

Denote by ej := Y^iLi e i-j the average number of mistakes by predictors j on the ghost sample. 
We can order them by increasing order: e(i), ...,e(jv)- Let I :— [N/2 + lj be the strict majority. An 
interesting case is when we know that a strict majority of classifiers are very good. Denote by 

1 1 

G \ — r 

1 3 = 1 

their global average error of the first / best classifiers on the ghost sample. 



In the same way, denote by fij := Ex,YL(Y,(f>j(X)) the risk of the j-th classifier. We introduce 
now a cross-validation estimate of the average risk j X^=i A*(j) °f the I best classifiers: R^y ■ For 
this, recall that each 4>j corresponds to some <p v tr thus we can define an out sample error for the 

predictor j : fj := W n ,v*f (L{Y, (f)j(X)). And we define R^y := j Y^j=i 



1. i?l 


:=R n {^)-lR^ 


2. i? 2 


:= R n ($*) - eg 


3. i?3 


— e B - le G 


4. i? 4 




5. i?5 


: = Ej-=i mo) -Rev) 


We have 





Pr(i?i > 3e) < Pr(i? 2 > e) + Pr(i? 3 > 0) + Pr(i? 4 > e) + Pr(i? 5 > e) 
By Hocffding's inequality, we have: 

Pr(i? 2 > e) < exp(-2me 2 ). 

We also derive 

Pr(i? 4 >e) = Pr(e£ - \ £< =1 > e/i) = Pr(£ < =1 e (j) - £* =1 > e) 



19 



There exist permutations a and a such that = e a ^ and (j,^ = Mcr'Q)- Thus, we get 

i 

Pr(R 4 >e)<Pr(J2e aU) -^ {j) >e) 

3=1 

I 

^ Pr (Ev(i)-V (i ) >e) 
by definition of e^-). It follows that 

l 

Pr(R 4 >e)<J2^K'U)-^'(j)>^ 

3 = 1 

< I exp(— 2me 2 ). 
In the same way, we deduce Pr(i? 5 > e) < I exp(— 2np n e 2 ). 

By conditional Hocffding's inequality (for a proof, see e.g. [?]), we deduce Pr(L 5 > e) < cxp(— 2np n (2e) 2 ) 
and also for a fixed v l £ 

Pr(|e5 - Ex, Y L{Y,<t> v tr{X))\ > e) < 2cxp(-2me 2 ). 
By conditional Hocffding's inequality (for a proof, see e.g. [?]), we also have 

Pr(\e a m - E x ,Y^v*rL{Y,4tv^r(X))\ > e) < 2exp(-2me 2 ). 

Notice that if all the I best classifiers classify correctly the i-th observation (i.e. ^i.(j) = for all 
j e {1,...,M}), then the subbaged classification classifies also correctly. Thus rn = 0. Let n be 
the number of mistakes of the subbaged classifier on the ghost sample and let x the number of 
observations correctly classified by all the I classifiers. Then we obtain that the number of correctly 
classified observations by the subagging is greater that x, i.e. m — n> x. On the other hand, there is 
at least one predictor that makes a mistake on each of the remaining m — x observations. Thus to — x 
is less that the total number of mistakes made by the I best classifiers 

(to — x) < mleg. 

From which, it follows that 



Thus Pr(i? 3 > 0) = 0. 
Putting altogether, we have 

Pv(R n ($%) - lRcv j > 3e) < exp(-2TOe 2 ) + Zexp(-2TO£ 2 ) + I exp(-2np„e 2 ). 
If we let m -> oo, Pr(£„($f ) - IR^ft > e) < I exp(-2np n £ 2 /9). 

Once again, in the particular case of binary classification, we have by symmetry 1 — < 1(1 — eg) 
which leads to 

e*> 1-1(1- eg). 
In the same way, we have a symmetrical result for binary classification: 

Pv(R n (^) - (IRcv ~ I + 1) < -3e) < exp(-2TO£ 2 ) + Zexp(-2TO£ 2 ) + I cxp(-2np„e 2 ) 

< I exp(— 2np n e 2 ). 
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which gives Pr(|i?„($£) - (IR^ - / + 1)| > e) < 2/ exp(-2np„£ 2 /9). 

□ 

In the case of subagging of classifiers (i.e. the majority vote) whose VC dimension is finite, we can 
obtain a stronger result: 

Theorem 33 Suppose % holds and that the machine learning is based on empirical risk minimization. 
We can bound the excess risk. 

Pr(i?„($f) - l£g# > e) < min(exp(-8np„ £ 2 /9), (2n(l-p») + ifVc/{i-v n ) e -An{i-p n )^y 
and also 

Pv(R n ($%) - IR%? >e)< lexp(-2np n e 2 /9) 

with the I := [{N — l)/2] + l the strict majority of the subagged classifiers and R^y the cross-validated 
estimate of this majority. 

Furthermore, in the particular case of binary classification we also have 

Pr{R n ($Z)-{R°f/2-l/2)) < -e) < min(exp(-2np„ £ 2 /9), (2n(l -p n ) + rfVc / {i- Pn ) e -±n{i- Pn )e* ) 
and 

Pr(i?„($f ) - [IR™$ - I + 1) < -e) < /cxp(-2np„e 2 ) 

Proof. 

We use again the lemma (for a proof, see chapter 1): R^y > F n L(Y, 4> n {X)) since 

1 " 

4>n = argmiri - Y] L(Y f , </>(Xj)). 
4>eC n 

i—l 

Following the last proof, we can bound L 5 in another way. 

Pr(L 5 > 3e) < Pv(E v t r [E XtY L(Y, 4> V ^{X)) - V n L{Y^ n {X))] > 6e) 
< Pr(E y ^[^ y L(r,^(X)) - F n L(XMX))] > 6e) 

Then as in proof, we split according to PL(Y, op t(X)) and we obtain by lemma I2T1 
Pr(i 5 > e) < (2n(l- Pn ) + i)4^/(i-Pn) e -n(i- P „)(2e) 2 

□ 
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5 Results for the subagged predictor selection 

The remaining important question is: in practice, how should we choose p n l We give a hint for this 
question. 

First, suppose that the final user wants to have an accuracy equal to a certain level r\. 

Then we need to provide him a rule to chose an optimal p* and to upper bound the probability of 
excess risk Pr(i?„(</>^' p ")~^py(p*) > r\). Previous bounds tell us that for any fixed p n , Pr(i?„(</>^) — 
RcviPn) > e) < min(B(n,p n ,e),V(n,p n ,e)). Notice that mm(B(n,p n ,s),V(n,p n ,e)) seen as a 
function of £ is a continuous non-increasing function. Thus, we can define an inverse denoted by /. 

The previous probability bound becomes for any p n : Pr(i?„(<^) — R°y{p n ) > f(n,p n ,S) < 8. 
For each k, define 5 nt k by f(n,k/n,5 nt k) = f], i.e. 5 n ,k = mm(B(n,k/n,r)),V(n,k/n,r))). Denote 
fc* := argmin^g^! R^y(k/n) + f(n, k/n, 5 n .k) and denote by p* := k* n jn. Thus, we obtain: 

Theorem 34 (Subbaging selection) Suppose that % holds. Suppose also that 4> n is based on empir- 
ical risk minimization. But instead of minimizing Rn{4>), we suppose <p n minimizes ^ Ym=i C(h(Yi,(f)(Xi)). 
For simplicity, we suppose the infimum is attained i.e. <f>„ = argmin^ e c ^ Y^i=\ C{h(Yi, <j>{Xi)). In 
this context, we have: 

• if S>5 n 

/ln(l/<5) 



f{n,p n ,S) 
and if 5 < S n , 



2np n 



f(n,p n ,5) = 3 



AV C ln(2n(l - p n ) + !)/(! - p n ) + In(l/5) 



n 



with 5 n := (2n(l - p n ) + 1) d-p^nuS^) . 
Furthermore, we have for all e > 0: 



Pr(i?„(<^") - R%$(Pn) > e) = On((n + lf Vc exp 



1 - exp(-2e 2 ) 



Proof 

We have: 



Pv(R n (<f>*' p ") - RSviPn) >V)= Pr(i?„(0f p; ) - R%&(p*) > /(n,p*A,fc*)) 

< Yl ^(Rn(<t>n' Pk )>Rcv(Pk) + f(n,k/n,S n , k )). 

fc£{l...n-l} 

It follows that: 

Pr(i^*»)-.^(p*)>,7)< Yl P« W )-^'W>1) 

fee{l...n-l} 

< min(S(n, k/n, rf), V(n, k/n, r])). 

fee{l...n-l} 

Thus, using previous bounds we get: 
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feo-1 

Pr(R n (^n' K ) - RSviPn) >V)< min (V (2n(l - fc/n) + l)«W(i-fc/n) e xp(-2nr, 2 ) 

fc €{l...n-l} J 



n-1 



+ ^ exp(-2fcr/ 2 )) 



k—kn 



< min (fc (2n+ l) 4yc /( 1 - fe °/™)cxp(-2nr ? 2 ) 

fc e{l...n-l} 

+ exp(-2fc r? 2 ) ; PV , ) 



a* 



1 — exp(— 2?y 2 ) 

< min ((2n + i)4Vfc/(i-fco/n) a n + with a . = cxp (_ 2 77 2 ) 

fc €{l...n-l} 1 — a 

We look for ko in {(1 — z n )n, < z„ < 1 and z n -^ noo 0} 

Pr(i?„(0f p ") - Ag#(p*) > t?) < min ( (2n + l)«*/*»a" + ___ 
We look for z n such that (2n + Vf y chn ^ noo 

Let us even find z n such that (2n + l) 4V c/«n — ^ . It is thus equivalent to: — nln(a)z 2 — ln(l — 
a)2„ - 4Vc ln(2n + 1) = 

We have A = ln(l - a) 2 - 16V C ln(2n + l)nln(a) > since \a\ < 1 

Since < z n < 1 , we have necesseraly z n the non negative root of the previous equation which leads 
to: 

_ ln(l - a) + y/\n(l - a) 2 - 16V C ln(2n + l)n \n{a) 
Zn ~ -2nln(a) 

4V^ 2 ,[^n) 



In(l/a)V2V n 



2\/2V^ 2 /ln(n) 



77 V n 



We can inject z„ in (2n + l) 4Vc / z "a™ + ° and we find that 

Pr(i?„(4' P ") - Rgtffa) >r,) = O n ((n + lf Vc cxp(-2n(r / - 2^ c 1/2 VTn7^) 2 )/(l - exp(-2^ 2 )) 
Let us now find the expression of / the inverse of min £ (_B(n,p„, e), V(n,p n , e)) with 



B(n,p n , e) = min((2n(l - p n ) + 1) 1_P " exp(-n£ 2 /9)) 



• V(n,p n ,s) — exp(-2np„e 2 ). 



In the case of ERM algorithm, 

exp(-2np„e 2 ) < (2n(l -p„) + 1) 1_P " cxp(-ne 2 /9) 
if and only if —2np n e 2 < 1 4 _ 1 ^ ln(2n(l — p„) + 1) — ne 2 /9 which is equivalent to 

n(l/9-2p„)e" < 



_ 2 ^ 4Vc ln(2n(l - p n ) + 1) 



1 ~Pn 



and akn r < / 4Vc ln(2n(l-p„) + l) ._ 

and also e s \l n (i- Pn ){i/9-2 Pn ) — £ «- 
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Thus if e < £„, it follows that mm(B(n,p n , e), V(n,p n ,e)) = cxp(— 2np n e 2 ), thus if 5 = cxp(— 2np n e 2 ) 
we deduce that £ = J ln „J^ ■ If£ > e n , min(B(n,p n ,e),V(n,p n ,e)) = (2n(l-p„)+l) 1_Pn cxp(-n£ 2 /9). 



^ v<2 j 

Thus if S = (2n(l-p„)+l) '- pn cxp(-n£ 2 /9), we then deduce that e = 3 ^ 4V C i°(2n(i- P „)+i)/(i-p„)+in(i/i) 
Denote S n = cxp(-2np n el) = ex p(- ^M^Elgjf 1 ) - (2n(l - p„) + 1)" 



4p„ V c 
" (l-p„)(l/3-2p„) 



In conclusion, if S > S n , we have: 



f(n,p n ,S) = 
and if S < 5 n , 



2np n 



AV C H2n{l-p n ) + !)/(! - Pn ) + Ml/6) V c H2n+l)+Hl/S) 
f(n,p n ,6) - 3\/ < 6 J . 



□ 

In summary, the probability of the deviation between the out-of-bag cross-validation estimate and 
the generalization error is bounded by the minimum of a Hocffding-type bound and a Vapnik- 
Chcrnovenkis-type bounds, and thus it is smaller than 1 even for small learning sets. Finally, we 
also give a simple rule on how to subbag the predictor. However, in the case of classification, we show 
that subagging strong learners can give a strong learner. It would be more interesting to answer the 
following question : can we obtain a similar result with the subagging of weak learners ? 
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6 Appendices 

We will use the definition of strong difference bounded introduced by [KUT02] and a corollary of his 
main theorem inspired by Mc D89j . 

Definition 35 (Kutin[KUT02j) Leifii,...,fi n be probability spaces. Let fl = Jll-=i ^ an d let X 

a random variable on f2. We say that X is strongly difference bounded by (6, c, S) if the following 
holds: there is a "bad" subset B C fl, where S — P(_B). Ifu),u>' € Q differ only in k-th coordinate, and 
lo (fc B, then 

\X(w)-X(w')\ < c 

Furthermore, for any u>,u>' € 

\X(oj) - X{u')\ < b 

We will need the following theorem. It says in substance that a strongly difference bounded function 
of independent variables is closed to its expectation with high probability. 

Theorem 36 (Kutin[KUT02]) Let fli, . . . , ft n be probability spaces. Let il = Y[k=i ^fc an d let X a 

random variable on Q, which is strongly difference bounded by (6, c, <5). Assume b > c > and a > 0. 
Let /i = E(X). Then, for any t > 0, 

2 

t n 
Pr(X - /x > r) < 2(cxp(- ) + -6) 

an(c + oa) a 

We will use the definition of weak difference bounded introduced by [KUT02) and a corollary of his 
main theorem. 

Definition 37 (Kutin) Let fii, . . . , Q n be probability spaces. Let fl = YYk=i ^fe an d let X a random 
variable on f2. We say that X is weakly difference bounded by (b,c,5) if the following holds: for any 

k, 

v 6 (cj, v) e n x n k , f(\x(l>) - x(lj')\) < c 

where UJu = V and uj, i = uii for i ^ k. and the notation V 5 a;, $(w) means "$(u;) holds for all but but a 
S fraction of " 

\X(u))-X(u)')\ < c 
Furthermore, for any to, to' € £1, differing only one coordinate: 

\X(u) -X(uj')\ < b 

We will need the following theorem. It says in substance that a weakly difference bounded function 
of independent variables is closed to its expectation with probability. 

Theorem 38 (Kutin) Let Q\, . . . , Q, n be probability spaces. Let Q = Y\k=i ^fc an d let X a random 
variable on Q. which is weakly difference bounded by (6, c, <5). Assume b > c > and a > 0. Let 
/i = K(X). Then, for any e > 

^ ,, £ 2 v 2nbS 1 / 2 . eb „ cl /9 

Pr(|* - M | > e) < 2 exp(- 10nc2(1 + i|b)2 ) + — exp(^)) + ^ 
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